Extensions of linear regression models based on 
set arithmetic for interval data 
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1 '~ l 1 Extensions of previous linear regression models for interval data are 

presented. A more flexible simple linear model is formalized. The new 
t> model may express cross-relationships between mid-points and spreads 

of the interval data in a unique equation based on the interval arithmetic. 
Moreover, extensions to the multiple case are addressed. The associated 
least-squares estimation problems are solved. Empirical results and a 
real-life application are presented in order to show the applicability and 
the differences among the proposed models. 

keywords: multiple linear regression model; interval data; set arith- 
metic; least-squares estimation 



1 Introduction 



The statistical treatment of interval data is recently being considered ex- 
tensively (see P, El El HI El E]). Interval data are useful to model vari- 
ables with uncertainty in their formalization, due to an imprecise observation 
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or an inexact measurement, fluctuations, grouped data or censoring. Lin- 
ear regression models for interval data have been previously analyzed (see 
|3 El El EH El H21 H31 H H51). Regression models with interval- valued ex- 
planatory variables and interval-valued response are considered. There are 
two main approaches to face these kinds of problems. One is based on fitting 
separate models for mid-points and spreads (see [T3| E]). This approach has 
not been considered under probabilistic assumptions on the population mod- 
els, and inferential studies have not been developed yet. This is non-trivial, 
since the non-negativity constraints satisfied by the spread variables prevent 
the corresponding model to be treated as a classical linear regression. Thus, 
although the usual fitting techniques are used, the associated inferences are no 
longer valid. The second approach overcomes this difficulty by considering a 
model based on the set arithmetic (see [9j HI]). The least squares estimators 
are found as solutions of constrained minimization problems and inferential 
studies have been developed in [lOj and [12], among others. 

Extensions for the simple linear regression models within the framework of 
the work in [9 J and [llj are developed. On one hand, a more flexible simple lin- 
ear model is formalized. The previous regression functions model the response 
mid-points (respectively spreads) by means of the explanatory mid-points (re- 
spectively spreads). The new model is able to accommodate cross-relationships 
between mid-points and spreads in a unique equation based on the set arith- 
metic. As the model in [11], the new one is based on the so-called canonical 
decomposition of the intervals. On the other hand, extensions to the multi- 
ple case are addressed. Due to the essential differences of the model in [9] 
and those based on the canonical decomposition, two multiple models will be 
introduced. The least-squares (LS) estimation problems associated with the 
proposed regression models are solved. Some empirical results and a real-life 
application are presented in order to show the applicability and the differences 
among the proposed models. 

The rest of the paper is organized as follows: In Section [2] some preliminary 
concepts about the interval framework are presented and several previous sim- 
ple linear models based on the set arithmetic are revised. Extensions of those 
linear models are introduced in Sections [3j [4] and [5] The theoretical formaliza- 
tion and the associated LS estimation problems are addressed. In Section [6] the 
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empirical performance and the practical applicability of the models are shown 
through some simulation studies and a real-life case-study. Finally, Section [7] 
includes some conclusions and future directions. 

2 Preliminaries 

The considered interval experimental data are elements belonging to the space 
/C C (R) = {[01,02] : ai,ci2 G R, ai < a 2 }. Each interval A G /C C (R) can be 
parametrized in terms of its mid-point, mid A = (sup A + inf A)/2, and its 
spread, spr A = (sup A — inf A)/2. The notation A = [midA ± spr A] will 
be used. An alternative representation for intervals is the so-called canonical 
decomposition, introduced in given by A = midv4[l ± 0] + sprA[0 ± 1]. It 
allows the consideration of the mid and spr components of A separately within 
the interval arithmetic. 

The Minkowski addition and the product by scalars form the natural arith- 
metic on /C C (1R). In terms of the (mid, spr)-representation these operations can 
be jointly expressed as 

A + \B = [(midA + AmidE) ± (spr A + |A|sprS)] 

for any A,B G /C C (M) and A G R. The space (/C C (1R), +, . ) is not linear but 
semilinear (or conical), due to the lack of symmetric element with respect 
to the addition. /C C (R) can be identified with the cone 1 x R + of I 2 . The 
expression A + (— 1)B generally differs from the natural difference A — B. If it 
exists C = A — h B G /C C (R) verifying that A = B + C, C is called Hukuhara 
difference between the pair of intervals A and B. The interval C exists iff 
spri? < sprA 

For every A,B G /C C (R), the I/2-type generic metric in |16| is defined as 

d e (A,B) = ((midA - mid5) 2 + 9 (sprA - spr^) 2 )^ (1) 

for an arbitrary 9 G (0, 00). 

Given a probability space (Q,A, P), the mapping x : £1 — > /C C (1R) is said 
to be a random interval iff mid x, spr x : Q — > R are real random variables 
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and spr x > 0. Random intervals will be denoted with bold lowercase letters, 
x, random interval-valued vectors will be represented by non-bold lowercase 
letters, x, and interval-valued matrices will be denoted with uppercase letters, 
X. 

The expected value of x is defined in terms of the well-known Aumann 
expectation, which satisfies that 

E(x) = [E(m.idx) ±E(sprx)], (2) 

whenever mid x, spr x G L 1 . The variance of a random interval x can be de- 
fined as the usual Frechet variance (see [17J) associated with the Aumann 
expectation in the metric space (/C C (R), d$), i.e. 

°l = E\d 2 9 {x, E{x))^j = al lidx + 9a 2 prx , 

whenever mid x, spr x G L 2 . The conical structure of the space /C C (R) entails 
some differences to define the usual covariance (see |18|). In terms of the do 
metric it has the expression 

&x,y ^midcc,midf/ 9<T S p VXS p T y } 

whenever those classical covariances exist. The expression Cov(x,y) denotes 
the covariance matrix between two random interval- valued vectors x = (x±, . . . , x^) 
andy= (y l7 ...,y k ). 

Let x, y : fi — >■ /C C (1R) be two random intervals. The basic simple linear 
model (see jH]) to relate two random intervals has the expression: 

y = bx+e (3) 

with i)6R and e : Q — > /C c (^) is an interval- valued random error such that 
E[e\x] = A G /C C (R). The LS estimation of Q has been solved analytically 
by means of a constrained minimization problem in [9]. 

Model ([3]) only involves one regression parameter b to model the depen- 
dency. Thus, it induces quite restrictive separate models for the mid and 
spr components of the intervals. Specifically, midy = bmidx + mids and 
spry = |6|sprac+ spre. 
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A more flexible linear model, called model M, has been introduced in 
It is defined in terms of the canonical decomposition as follows: 

y = tf-widx [1 ± 0] + b 2 sprx [0 ± 1] + 7 [1 ± 0] + e, (4) 

where b l ,b 2 G K are the regression coefficients, 7 6 R is an intercept term 
influencing the mid component of y and £ is a random interval error satisfying 
that £?[e|ac] = [—5,5], with 5 > 0. From Q the linear relationships midy = 
6 1 midx + 7 + mide and spry = |6 2 |spra; + spre are transferred, where b 1 and 
b 2 may be different. The LS estimation leads to analytic expressions of the 
regression parameters of model M (see Confidence sets based on those 

estimators have been developed in [12J. 



3 A flexible simple linear regression model: the 
model Mq 

Following Q, the model Mg between x and y is defined as: 

y=6 1 mida;[l±0]+6 2 spra;[0±l] + 6 3 mida;[0±l]+6 4 spra;[l±0]+7[l±0] + £, (5) 

where b\ 7 <E R, i = 1, . . . , 4 and E(e\x) = [-5, 5} G /C C (M), 5 > 0. The linear 
relationships for the mid and spr variables transferred from (|5| are 

midy = ^midx + 6 4 spra; + 7 + mide 

and 

spry = |6 2 |spra;+ |6 3 ||mida;| + spre. 

Thus, both variables midy and spry are modelled from the complete infor- 
mation provided by the independent random interval x, characterized by the 
random vector (midx, spra;). 

For a simpler notation, the random intervals defined from x are denoted 
by a^ f , X s , x c and X s , in the same order as they appear in Q. Thus, the 
model Mg is equivalently expressed as: 

y = b l x M + b 2 x? + b 3 x c + 6 4 a^ + 7 [1 ± 0] + e. 
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Moreover, in order to unify the notation for the estimation problems of the 
different linear models, the real interval A = [7 — 5, 7 + S] is defined. Then, 
the regression function associated with the model Mq can be written as: 

E(y\x) = b 1 x 1 ^ 1 + b 2 X s + b 3 x c + b 4 ^ + A. (6) 

Since X s = —X s and x c = —x c , the model M G always admits four equiv- 
alent expressions. This property allows the simplification of the estimation 
process, because it is possible to search only for non-negative estimates of the 
parameters b 2 and b 3 . 

Given a random sample {{xj, yj)}™ =1 obtained from two random intervals 
(x, y) verifying (|5]), the LS estimation of the parameters (b 1 , b 2 , b 3 , b 4 , A) in ([6]) 
consists in minimizing 
1 n 

- 4(Vi, axf + bxf + cxf + dxf + C) (7) 
i=i 

over (a, 6, c, d, C) £ I 4 x /C C (R). However, since from equation ([5]), e$ = y t —h 
(b 1 ^ + b 2 xf + b 3 xf + b 4 xf), §7§ must be optimized over a suitable feasible set 
assuring the existence of the sample residuals, i.e., the corresponding Hukuhara 
differences. Note that 

spr (aa^ + bxf + cxf + dxf) = |6|spr^ + |c||mida^| 

for all % = 1, . . . ,n and b 2 and b 3 can be assumed to be non-negative. Then, 
taking into account the condition guaranteeing the existence of the Hukuhara 
difference, the feasible set can be expressed as 

^G = {(b, c) G [0, 00) x [0, 00) :b sprXi + c|mida:;| < spry i5 Vz = 1, . . . , n}. (8) 

If (b 1 , b 2 , b 3 , b 4 ) G M 4 denotes a feasible estimate, then the interval param- 
eter A can be directly estimated by 

A = y- H (fix™ + + b 3 af + b 4 !^ . 

As a result, the LS minimization problem is 

1 n 

min -S^dlly-Hiaxf+bxf+cxf+d xf ), y-H(ax^+bx^+cW+dx^)\ . 
(a,d)£R 2n t! V ' J 

(b,c)er G 

(9) 
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The problem ([9]) can be solved separately for (a,d) and (6, c). The minimiza- 
tion over (a, d) is done without restrictions and it leads to the following analytic 
estimators of (b l ,b ) in the model Mq: 

(?,P)* = S7V (io) 

Here zi = {& x m y ,a x R ty y and 5*1 corresponds to the sample covariance matrix 
of the interval- valued random vector (x^ 1 , ar^). 

The minimization for (6, c) is performed over the feasible set Fq, which is 
nonempty, closed and convex. The objective function to be minimized over 
(b, c) can be expressed as the globally convex function 

g(b, c) = (b) 2 a 2 xS + (c) 2 a 2 xC + 2bca x s iX c - 2ba x s tV - 2ca x c ty . (11) 



If the global minimum of the function g is so that (b*, c*)* ^ Tq, then the local 
minimum of g over is unique, and it is located on the boundary of Tq. The 
boundary of Tq, denoted by frfT^), verifies that 

fr(r G ) = Lx U L 2 U L 3 , (12) 

where L i} i = 1, 2, 3 are the following sets: 

. U = {(0, c) | < c < r = min, , „ 

• L 3 = {(6,0) | < b < s = min, : „ 

• L 2 = {{b, mm k=l ,„ n {-u k b + v k }) | < b < s }, with 

sprzfc , spry k 

u k — i ; — r an( i u fc = i r Ior all = 1, . . . , n. 

Imidsfel |mida;fc| 



The set L 2 is composed on several straight segments from some of the 
straight lines {l k : c = — u k b + t>fc}£=i- If (mida^l = for any k G {1, . . . , n}, 
then the corresponding straight line is b = spYy k /sprx k for spra^. ^ 0. Thus, 
it is a vertical line, which could take part in L 2 only if spry k /sprx k = sq. 
Moreover, if spra^ = too, then the sample interval x k is reduced to the real 
value x k = 0, so it does not take part in the construction of T G . In Figure 1 the 
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feasible set and its boundary in a practical example are illustrated graphically. 



The sample data corresponds to a real-life example (see Section 6.1) 



In order to find the exact solution of min(& )C ) g r G g(b, c) the global minimum 
of g should be computed and, if needed, the local minimum over Lj, i — 1,2, 3. 

The asymptotic time complexity of the algorithm is 0(nt), where t is the 
number of lines in {l k }^ =1 taking part in fr(rc). The straight lines in {Ik : 
c = —Ukb + ffc : k ^ (v) , (h)}^ =1 such that —u^b^^ + v k > C(v,h) do not take 
part on the construction of fifTc). Thus, they can be ignored from Step 5 
to the end of the algorithm. However, for practical examples with moderate 
sample sizes n, this reduction will result in a negligible improvement on the 
computational efficiency of the algorithm. 



4 The multiple basic linear regression model 

Let y be a response random interval and let xi, x%, . . . , a% be k explanatory 
random intervals. The multiple basic linear regression model (MBLRM) ex- 
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tending (J3J) is formalized as: 

y=x t b + e (13) 

being x = (x±, x 2 , . . . , b = (pi, b 2 , . . . , G M fc and e an random interval- 
valued error such that i£(e|a;) = A G /C C (1R). The associated regression func- 
tion is = Xi, . . . , Xk = Xk) = x l b + A. Thereafter, the second-order 



moments of the random intervals involved in the linear model (13) are as- 



sumed to be finite, and the variances strictly positive. If the mids and spreads 



of the explanatory intervals are not degenerated, then (13) is unique. The 
following separate models are transferred: 

midy = mid(x*) b + mid e, and 
spry = spr(x*) \b\ + spr e. 

The mid variables relates through a standard (real-valued) multiple linear 
model, but this is not the case for the spreads, due to the non-negative re- 
strictions. 

Let { [y^ x i; j) : i = 1, . . . , k, j = 1, . . . , n} be a simple random sample of 



size n obtained from y and x = (xj, . . . , x^) verifying (13). Then 



y = Xb + e, (15) 

where y = (y 1; . . . , y n Y, X is the (n x fc)-interval-valued matrix such that 
Xjj = Xij, and e = (ei, . . . , e n )* fulfils E'fejX] = l n A, l n denoting the ones' 
vector in M n . The LS estimation consists in finding b and A minimizing the 
objective function: 

min d 2 e (y,Xd+ l n C) , (16) 

dgR fc ,Ce/Cc(]R) 

constrained to the existence of the residuals e = y —h Xb. If b G T = {a G 
R fc : y —h Xa G /C C (1R)' C }, then the optimum value over C is attained at 

A = y- H x^b. (17) 



Extending directly the estimation method in [2] would lead to a compu- 
tationally unfeasible combinatorial problem. For that, a non-optimal stepwise 
algorithm has been proposed. However, that may be offset by estimating sepa- 
rately the absolute value of b and its sign. Note that b = \b\o sign(b), and from 
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(14), sign(b) is only determined by the sign of the relationship between the 
mid-points. Then, sign(b)i = sign(Cov(midy } midxi)) and can be obtained 
as the solution of 



min d 2 9 (y,X(a o sign(b)) + l n A). 

agr,a>0 



The feasible set V — T fl (M fc ) + in (18) can be expressed as 
r' = {de (R k ) + : (spiX)d < spry}. 



(19) 



A more operative expression for the objective function in (18) is: 



4(y, X(a o sign{b))+l n A) = (v m - F' m a)\v m - F' m a) 

+6(v s -F' s a) t (v s -F' s a), 



(20) 



where v m = midy — uridyl™ G lR n , v s = spry — spryl™ G lR n , F m = (midX — 
l n (midx*)), F s = sprX — l n (sprx*), F' m = F m dia,g(sign(b)i, . . . , sign(b)k) G 
l nxt , F' s = F s diag(sign(b) 1 ,...,sign(b) k ) G R nxk and a G R k . Since the 
optimization problem consists in minimizing a quadratic expression with in- 
equality linear constraints, Karush-Kuhn- Tucker (KKT) conditions guarantee 
the existence of solution and it can be found by using a standard software. 



5 The multiple flexible linear regression model 



From (|5]), a multiple flexible linear regression model (MFLRM) can be defined 



as: 



y = midx* [1±0] b 1 +sprx t [Oil] 6 2 +midx* [Oil] 6 3 +sprx* [liO] b 4 +s , (21) 



where 6 1 , fo 2 , 6 3 , 6 4 G R k and E{e\x t ) = A G /C C (R). Equivalently Q can be 
written as: 

y = x M b 1 + x s b 2 + x c b 3 + x R b 4 + e, (22) 



or, in matrix notation, as: 



y = X Bl B + e 



(23) 
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where X Bl = (x M \x s \x° \x R ) G /C c (M) lx4fc and B = ((fo 1 )*! (6 2 )*| (fo 3 )*| (6 4 )*)* G 
M 4fcxl . The values b 2 and b 3 can be assumed to be non- negative without loss 
of generality since x s = —x s and x 



The separate linear relationships for the mid and spr components of the 
intervals transferred from (21) are 

mid y = mid (x*) b 1 + spr (x*) b 4 + mid e , and (24) 

spr y = spr (x*) b 2 + |mid (x*) | 6 3 + spr e. (25) 

Let {(yj,a%,j) : « = 1, - - - , A:, j = l,...,n} be a simple random sample 



obtained from the random intervals (y, Xi, . . . , a;^) verifying (21 ). Then, 
2/ = X M 6 1 +X 5 6 2 + X c 6 3 + X /? 6 4 + £ , 



1 1 -J 



where y — (y 1; . . . , y n ), e = (ei, . . . , e n ) such that E(e\x) = l n A, (X 
midajjfl ± 0] and X 5 ,X C and X R are analogously defined. It can be equiva- 
lently expressed in a matrix form as 

y = X ebl B + e , (26) 

where X ebl = (X M \X S \X C \X R ) G /C c (M) nx4fc and B as in & 



The LS estimation searches for i? and A minimizing dg(y,X ebl A + \ n C) 
for A G M 4fcxl and C G /C C (R). The constraints to assure the existence of the 
residuals are: 

sprX6 2 + |midX| b 3 < spry. (27) 



The estimation of B and A can be solved separately. If B verifies (27), 
then the minimum value of d 2 e {y } X ebl B + l n C) over C G /C C (R) is attained at 
A = y —jj X Bl B. The objective function can then be written as 

d 2 e (y, X ebl A + 1"A) = (v m - F m A m y (v m - F m A m ) + 6(v s - F s A s )\v 8 - F S A S ) , 



where v m , v s G R n , F m , F s G R nx2k are defined as in (|20j), A m = ((a 2 )*|(a 



,2U| f„3\t\t 



M. 2k are the coefficients affecting the mids and A s = ((a 1 )*^ 4 )') G M. 2k the 
coefficients affecting the spreads, with a 1 G M. k , I — 1, ... ,4. 
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Therefore, the computation of the LS estimator B for the regression pa- 



rameter B in (23) is solved through the constrained optimization problem by 



KKT conditions: 

min (v m - F m A m )\v m - F m A m ) + 6{v s - F s As)\v s - F S A S ) , (28) 

A s eK 2 M m er 2 

with 

T 2 = {(a 2 , a 3 ) E [0,oo) fc x [0, oo) k : sprXa 2 + |midX|a 3 < spry}. (29) 
Note that the extension of the linear regression model M developed in [TT] 



to the multiple case is directly achieved from (21 ), taking b 3 = (0, . . . , 0) and 
6 4 = (0,...,0). 



6 Empirical results 

6.1 Application to a real- life example 

A real-life example concerning the relationship between the daily fluctuations 
of the systolic and diastolic blood pressures and the pulse rate over a sample 
of patients in the Hospital Valle del Nalon, in Spain, is considered (previously 
explored in [TTj [Sj [9]). The metric di/$ is employed, and the optimization 



algorithm quadprog to solve the estimation of the multiple models (13) and 



(21 ) is run. 



Let y, X\ and X2 the fluctuation of the diastolic blood pressure of a patient 
over a day, the fluctuation of the systolic blood pressure over the same day, 
and the pulse range variation over the same day, respectively. Data in Table 
[T] correspond to a sample data of 59 patients from (y, X\, a%). 

From the sample data provided in Table [TJ the estimated model for y 
and X\ is 



y = 0.5383^ + 0.2641a;f - 0.4412af + [4.249, 35.254]. (30) 
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The value of determination coefficient R 2 (defined as the proportion of ex- 
plained variability) associated with this estimated model is 0.6857. 

The estimated model (13) for y and (xi, x%) from the data set in Table [T] 
has the expression: 



y = 0.4094a;i + 0.0463x 2 + [10.3630, 29.5168]. (31) 

The value of the determination coefficient is in this case R 2 = 0.4221. 

The linear relationship between y and (xi, x%) can be also estimated more 



naturally by means of the MFLRM. The estimation of the model (21) leads to 
the expression: 

y = 0.5435 xf + 0.0190 xf + 0.2588 xf + 0.1685 xf - 0.4446 xf 

+0.1113 x^+ [3.2032,27.8373] , (32) 

with R 2 = 0.7922. 



The highest value of R is achieved for (32), which agrees with the fact 
that MFLRM is the most flexible regression among the linear models that have 
been developed. The difference in the R 2 between this multiple model and the 



simple one in (30) is due to the inclusion of the pulse rate variable x 2 in the 
prediction of y. However, this difference is not large, which indicates that the 
pulse rate has low explanatory power. The smallest value of R 2 corresponds 



to (31). It indicates that the multiple basic model is too restrictive to relate 



these physical magnitudes. 



6.2 Simulation results 

The empirical performance of the regression estimates for each linear model 
is investigated by means of some simulations. Three independent random 
intervals Xi,X2,X3 and an interval error e will be considered. Let midxi ~ 
W(l,2), sprxx ~ W(0,10), midx 2 ~ Af{2, 1), spr x 2 ~ X 2 , midx 3 ~ W(l,3), 
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spr £3 ~ W(0, 5), mide ~ A/"(0, 1) and spr e ~ Xy . Different linear expressions 
with the investigated structures will be considered. 



Model M\\ According to the multiple basic linear model presented in 



(13), y it is defined by the expression: 

y=2xi- 5x 2 - x 3 + e. (33) 



Model M 2 : A simple linear relationship in terms of the model in ^ 
is defined by considering only X\ as independent interval for modelling y 
through the expression: 

y= -2x^ + 2xf + xf + 0.5 x? + e. (34) 



Model M 3 : A multiple flexible linear regression model following (21) is 
defined as: 

+0.5icf + x* - 3a£ + e. (35) 



From each linear model I = 10,000 random samples has been generated 
for different sample sizes n. The estimates of the regression parameters have 
been computed for each iteration. Table [2] shows the estimated mean value and 
MSE of the LS estimators (denoted globally by v) computed from the / itera- 
tions. The mean values of the estimates are always closer to the corresponding 
regression parameters as the sample size n increases, which empirically shows 
the asymptotic unbiasedness of the estimators. Moreover, the values for the 
estimated MSE tend to zero as n increases. 

In Figure 2 the box-plots of the I estimates of the model M\ are presented 
for n = 30 (left-side plot) and n = 100 (right-side plot) sample observations. 
All the cases the boxes reduce their width around the true value of the corre- 
sponding parameter on the population linear model as the sample size n in- 
creases, which illustrates the consistency. Analogous conclusions are obtained 
for the models M 2 and M 3 in Figures 3 and 4, respectively. 
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Figure 2: Box plot of the LS estimators for model Mi, n=30 (left); n=100 (right) 




Figure 3: Box plot of the LS estimators for model Ma, n=30 (left); re=100 (right) 




Figure 4: Box plot of the LS estimators for model M3, n=30 (left); n=100 (right) 
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7 Conclusions 



Previous linear regression models for interval data based on set arithmetic have 
been extended. In all cases the search of the LS estimators involves minimiza- 
tion problems with constraints. The constraints are necessary to assure the 
existence of the residuals and thus, the coherency of the estimated model with 
the population one. 

A very flexible simple model based on the canonical decomposition and 
allowing for cross-relationships between mid-points and spreads has been in- 
troduced. An algorithm to find the exact LS-estimates has been developed. 
This model has been extended to the multiple case. The LS exact algorithm 
strongly relies on the geometry of the feasible set and it cannot be generalized 
in a simple way. However, the LS estimates can be found by applying the 
KKT conditions. The extension of the basic simple model in j9], which is not 
based on the canonical decomposition, requires a different approach, but the 
solutions can also be found by applying the KKT conditions. 

The empirical validity of the estimation process for all the models has 
been shown by means of simulations. However, further theoretical studies of 
the main properties of the regression estimators, as the bias, the consistency 
or the asymptotic distributions should be pursued. 
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Table 1: Sample data for the daily blood pressures and the pulse rate ranges of 59 
patients 



y 


Xi 


X2 


y 


Xi 


X2 


y 


Xi 


X2 


63-102 


118-173 


58-90 


47-93 


119-212 


52-78 


71-118 


104-161 


47-68 


73-105 


122-178 


55-84 


58-113 


131-186 


32-114 


74-125 


127-189 


61-101 


62-118 


105-157 


61-110 


52-112 


113-213 


65-92 


59-94 


120-179 


62-89 


69-133 


141-205 


38-66 


48-116 


101-194 


63-119 


53-109 


99-169 


48-73 


60-119 


109-174 


51-95 


60-98 


126-191 


59-98 


76-125 


128-210 


49-78 


55-121 


99-201 


59-87 


47-104 


94-145 


43-67 


37-94 


88-221 


49-82 


88-130 


148-201 


55-102 


55-85 


113-183 


48-77 


52-96 


111-192 


64-107 


56-121 


94-176 


56-133 


74-133 


116-201 


54-84 


50-94 


102-156 


37-75 


39-84 


102-167 


47-95 


52-95 


103-159 


61-94 


55-98 


104-161 


56-90 


63-118 


102-185 


44-110 


45-95 


106-167 


44-108 


57-113 


111-199 


46-83 


62-116 


112-162 


63-109 


64-121 


130-180 


52-98 


67-122 


136-201 


62-95 


55-97 


103-161 


56-84 


52-104 


90-177 


48-107 


59-101 


125-192 


54-92 


58-109 


116-168 


26-109 


54-104 


97-182 


53-120 


50-111 


98-157 


61-108 


57-101 


124-226 


49-88 


47-108 


98-160 


54-78 


59-90 


120-180 


75-124 


60-107 


97-154 


53-103 


54-104 


100-161 


58-99 


47-86 


87-150 


47-86 


90-127 


159-214 


59-78 


77-158 


141-256 


70-132 


70-118 


138-221 


55-89 


62-107 


108-147 


63-115 


50-95 


87-152 


55-80 


65-117 


115-196 


47-83 


53-105 


120-188 


70-105 


42-86 


99-172 


56-103 


54-100 


95-166 


40-80 


57-95 


113-176 


71-121 


45-107 


92-172 


56-97 


46-103 


114-186 


68-91 


45-91 


83-140 


37-86 


100-136 


145-210 


62-100 
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Table 2: Experimental results for the estimation of the linear models 



Model 


v \ n 


30 


100 


500 






E(v) MSEiV) 


E(v) MSEiV) 


E(v) MSEid) 


Mi 


bi 


1.9732 0.0042 


1.9858 0.0008 


1.9933 0.0001 




b 2 


-4.9627 0.0056 


-4.9799 0.0013 


-4.9909 0.0002 




h 


-0.9809 0.0115 


-0.9926 0.0070 


-0.9961 0.0001 


M 2 


b l 


-2.0005 0.0097 


-1.9997 0.0026 


-2.0004 0.0005 




9 


1.9651 0.0052 


1.9809 0.0013 


1.9911 0.0003 




9 


0.9302 0.0230 


0.9588 0.0060 


0.9816 0.0011 




9 


0.4991 0.0044 


0.5004 0.0012 


0.5000 0.0002 


M 3 


b\ 


-2.0014 0.0114 


-2.0004 0.0026 


-2.0002 0.0005 




b\ 


5.0017 0.0465 


5.0007 0.0108 


5.0001 0.0020 




b\ 


1.0002 0.0111 


1.0001 0.0027 


1.0000 0.0005 




b\ 


1.9738 0.0082 


1.9837 0.0019 


1.9920 0.0003 




b% 


1.9763 0.0100 


1.9853 0.0020 


1.9920 0.0004 




b% 


0.9722 0.0082 


0.9841 0.0018 


0.9918 0.0003 




b\ 


0.9576 0.0413 


0.9691 0.0090 


0.9855 0.0015 




bl 


0.9097 0.0737 


0.9429 0.0171 


0.9717 0.0030 




bl 


2.9588 0.0410 


2.9709 0.0087 


2.9842 0.0015 




bf 


0.4996 0.0054 


0.5003 0.0013 


0.5001 0.0002 




bi 


0.9992 0.0060 


1.0002 0.0014 


1.0002 0.0003 




bi 


-2.9994 0.0053 


-2.9995 0.0013 


-3.0002 0.0003 
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List of Figure Captions 

• Figure 1: for the sample data in Table 1 

• Figure 2: Box plot of the LS estimators for Model M 1; n — 30 (left); 
n = 100 (right) 

• Figure 3: Box plot of the LS estimators for Model M 2 , n = 30 (left); 
n = 100 (right) 

• Figure 4: Box plot of the LS estimators for Model M 3 , n = 30 (left); 
n = 100 (right) 
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Algorithm 1 

STEP 1: Compute the global minimum of g, v — S 2 1 Z2, with z 2 = 
(a x s ^a x c : yY and S 2 the sample covariance matrix of (X s , x c ). 
If v E T G , then v is the solution, else goto Step 2. 

STEP 2: Compute r = mm i=lj ,„ jn spr?/j/|midxi| and identify the straight 
line in the set {l k : c = — + v k }^ =1 sucn (0> r o) £ If there 
exists more than one line in these conditions, then 1^ is the one for which 
the value — spry fe /|mida; fe | is lowest. 

STEP 3: Compute s = minj =lv .. )n sprt/j/sprXj and identify the straight 
line 1(h) in the set {l k : c = — u k b + v k }1 =1 such that (s ,0) G 1(h)- If there 
exists more than one line in these conditions, then l( h ) is the one for which 
the value — spry fc /|mida; fc | is greatest. 

zbf STEP 4: Let R = {l (v) }, C = {0,s }, D = {(v),(h)}, j = 1 and 

hi) = h*)- 

If (y) = (h), then redefine R = {I 1 }, C = {xV 1 }, let t = 1 and goto Step 
8 else goto Step 5. 

STEP 5: Compute (b(j,h), c (j,h)) the intersection point of the lines l(j) and 
1(h)- 

Check if (b(j t h),C(j,h)) £ ^(Xg), through the conditions 

i) b UA) e [0, s ], and 

ii) C( jA ) = min{-u k b( jt h) + v k : k = 1, . . . ,n}. 

If (b(j,h),C(j,h)) £ fr(rG), goto Step 7 else goto Step 6. 
STEP 6: Compute (fc(j,fc), the intersection points of and each line 
in {l k : c = — Wfefr + fjfc}fc=i such that k £ D. Take the line l k * such that 
(b(j,k*),C(j,k*)) £ fr(rG) (verifying the corresponding conditions i) and ii) 
shown in Step 5). If there exists more than one line in these conditions, 
choose as l k * the one for which the value — spry fc */|mida: fe » | is lowest. 
Let R = R U {l k *}, C = C U {b {jM) }, D = DU {k*}, j =j + 1, / 0) = Z fe *, 
and goto Step 5. 

STEP 7: Let R = R U {Z (h) } and C = C U 

Redefine i? = {l {v) , l k *, l k *, . . . , l k *, l (h) ) as {I 1 , 1 2 , I s , . . . , Z* _1 , Z*}, and C = 
{0, 6 (1)fcI) , 6 (fc . ifc;) , . . . , 6 (fc j, ft) , s } as {x°, x 1 , x 2 , . • • , x*" 1 , x'}. Goto Step 8. 
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STEP 8: For i — 1, . . . , t, compute the local minimum of g over the segment 
corresponding to the line V on [x 1-1 , given by the analytic expressions 

b\ = max {x l_1 , mm{b\ x 1 }} 
c\ = -Uibl + Vi 



II ; I ' :l*~~ 

where b 



UiVid x c - Vid x s^ x c - Uid x c )V + a x s jV 



o x s + u;a 2 xC - 2u i a x s jX c 
Compute g(bi,cl). 

Take (b L2 , cl 2 ) the point in {(b\, c\ )}- =1 for which the value g(b\ , c\) is lowest. 
Note that (&l 2 ,Cl 2 ) is the local minimum of g over L 2 . 

STEP 9: Compute (b Ll ,c Ll ) the local minimum of g over L 1; given by the 
analytic expressions 




mm < ^ 9 - ,r 



Compute g(b Ll ,c Ll ). 

STEP 10: Compute (&l 3 , Cl 3 ) the local minimum of g over L3, given by the 
analytic expressions 




mm <; -^2"^, s o 



Compute g(bL 3 ,CL 3 )- 

STEP 11: Take (6*,c*) the point in {(b Lj ,c Lj )} 3 j=1 whose value g(b Lj ,c Lj ) 
is lowest. Note that (&*, c*) is the local minimum of g on fr(S , G -). 



